fix: add retry logic for one time instruction #226

pratikjagrut · 2025-03-25T16:50:11Z

This PR addresses issue rancher/rancher#48916 by implementing retry logic for one-time instructions.

Changes

Added retry logic using retry.OnError for one-time instruction execution

Testing

Manually tested.
Logs:

[email protected]:/home/ubuntu# journalctl -fu rancher-system-agent.service
rancher-system-agent[82604]: time="2025-03-26T23:13:40Z" level=info msg="Rancher System Agent version 15b7568 (15b7568) is starting"
rancher-system-agent[82604]: time="2025-03-26T23:13:40Z" level=info msg="Using directory /var/lib/rancher/agent/work for work"
rancher-system-agent[82604]: time="2025-03-26T23:13:40Z" level=info msg="Starting remote watch of plans"
rancher-system-agent[82604]: time="2025-03-26T23:13:40Z" level=info msg="Initial connection to Kubernetes cluster failed with error Get \"https://69c34c5e4eeb.ngrok.app/version\": tls: failed to verify certificate: x509: certificate signed by unknown authority, removing CA data and trying again"
rancher-system-agent[82604]: time="2025-03-26T23:13:40Z" level=info msg="Starting /v1, Kind=Secret controller"
rancher-system-agent[82604]: time="2025-03-26T23:13:40Z" level=info msg="Detected first start, force-applying one-time instruction set"
rancher-system-agent[82604]: time="2025-03-26T23:13:40Z" level=info msg="[Applyinator] Applying one-time instructions for plan with checksum 0395dd07720d0f5717be20792c5b0aa2b5d09ca0d60592eb90b432a26f534749"
rancher-system-agent[82604]: time="2025-03-26T23:13:40Z" level=info msg="[Applyinator] Extracting image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1 to directory /var/lib/rancher/agent/work/20250326-231340/0395dd07720d0f5717be20792c5b0aa2b5d09ca0d60592eb90b432a26f534749_0"
rancher-system-agent[82604]: time="2025-03-26T23:13:40Z" level=info msg="Using private registry config file at /etc/rancher/agent/registries.yaml"
rancher-system-agent[82604]: time="2025-03-26T23:13:40Z" level=info msg="Pulling image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1"
rancher-system-agent[82604]: time="2025-03-26T23:14:41Z" level=warning msg="Failed to get image from endpoint: Get \"https://privateRegistry.com:5000/v2/\": dial tcp y.y.y.y:5000: i/o timeout"
rancher-system-agent[82604]: time="2025-03-26T23:14:41Z" level=error msg="error while staging: all endpoints failed: Get \"https://privateRegistry.com:5000/v2/\": dial tcp y.y.y.y:5000: i/o timeout: failed to get image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1"

rancher-system-agent[82604]: time="2025-03-26T23:14:41Z" level=error msg="[Applyinator] Execution failed: all endpoints failed: Get \"https://privateRegistry.com:5000/v2/\": dial tcp y.y.y.y:5000: i/o timeout: failed to get image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1"
rancher-system-agent[82604]: time="2025-03-26T23:14:41Z" level=info msg="[Applyinator] (attempt #1) Applying one-time instructions for plan with checksum 0395dd07720d0f5717be20792c5b0aa2b5d09ca0d60592eb90b432a26f534749"
rancher-system-agent[82604]: time="2025-03-26T23:14:41Z" level=info msg="[Applyinator] (attempt #1) Retrying instruction 0 for plan 0395dd07720d0f5717be20792c5b0aa2b5d09ca0d60592eb90b432a26f534749"
rancher-system-agent[82604]: time="2025-03-26T23:14:44Z" level=info msg="[Applyinator] Extracting image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1 to directory /var/lib/rancher/agent/work/20250326-231340/0395dd07720d0f5717be20792c5b0aa2b5d09ca0d60592eb90b432a26f534749_0"
rancher-system-agent[82604]: time="2025-03-26T23:14:44Z" level=info msg="Using private registry config file at /etc/rancher/agent/registries.yaml"
rancher-system-agent[82604]: time="2025-03-26T23:14:44Z" level=info msg="Pulling image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1"
rancher-system-agent[82604]: time="2025-03-26T23:15:14Z" level=warning msg="Failed to get image from endpoint: Get \"https://privateRegistry.com:5000/v2/\": dial tcp y.y.y.y:5000: i/o timeout"
rancher-system-agent[82604]: time="2025-03-26T23:15:14Z" level=error msg="error while staging: all endpoints failed: Get \"https://privateRegistry.com:5000/v2/\": dial tcp y.y.y.y:5000: i/o timeout: failed to get image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1"

rancher-system-agent[82604]: time="2025-03-26T23:15:14Z" level=error msg="[Applyinator] Execution failed: all endpoints failed: Get \"https://privateRegistry.com:5000/v2/\": dial tcp y.y.y.y:5000: i/o timeout: failed to get image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1"
rancher-system-agent[82604]: time="2025-03-26T23:15:14Z" level=info msg="[Applyinator] (attempt #2) Applying one-time instructions for plan with checksum 0395dd07720d0f5717be20792c5b0aa2b5d09ca0d60592eb90b432a26f534749"
rancher-system-agent[82604]: time="2025-03-26T23:15:14Z" level=info msg="[Applyinator] (attempt #2) Retrying instruction 0 for plan 0395dd07720d0f5717be20792c5b0aa2b5d09ca0d60592eb90b432a26f534749"
rancher-system-agent[82604]: time="2025-03-26T23:15:20Z" level=info msg="[Applyinator] Extracting image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1 to directory /var/lib/rancher/agent/work/20250326-231340/0395dd07720d0f5717be20792c5b0aa2b5d09ca0d60592eb90b432a26f534749_0"
rancher-system-agent[82604]: time="2025-03-26T23:15:20Z" level=info msg="Using private registry config file at /etc/rancher/agent/registries.yaml"
rancher-system-agent[82604]: time="2025-03-26T23:15:20Z" level=info msg="Pulling image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1"
rancher-system-agent[82604]: time="2025-03-26T23:15:50Z" level=warning msg="Failed to get image from endpoint: Get \"https://privateRegistry.com:5000/v2/\": dial tcp y.y.y.y:5000: i/o timeout"
rancher-system-agent[82604]: time="2025-03-26T23:15:50Z" level=error msg="error while staging: all endpoints failed: Get \"https://privateRegistry.com:5000/v2/\": dial tcp y.y.y.y:5000: i/o timeout: failed to get image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1"

rancher-system-agent[82604]: time="2025-03-26T23:15:50Z" level=error msg="[Applyinator] Execution failed: all endpoints failed: Get \"https://privateRegistry.com:5000/v2/\": dial tcp y.y.y.y:5000: i/o timeout: failed to get image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1"
rancher-system-agent[82604]: time="2025-03-26T23:15:50Z" level=info msg="[Applyinator] (attempt #3) Applying one-time instructions for plan with checksum 0395dd07720d0f5717be20792c5b0aa2b5d09ca0d60592eb90b432a26f534749"
rancher-system-agent[82604]: time="2025-03-26T23:15:50Z" level=info msg="[Applyinator] (attempt #3) Retrying instruction 0 for plan 0395dd07720d0f5717be20792c5b0aa2b5d09ca0d60592eb90b432a26f534749"
rancher-system-agent[82604]: time="2025-03-26T23:16:03Z" level=info msg="[Applyinator] Extracting image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1 to directory /var/lib/rancher/agent/work/20250326-231340/0395dd07720d0f5717be20792c5b0aa2b5d09ca0d60592eb90b432a26f534749_0"
rancher-system-agent[82604]: time="2025-03-26T23:16:03Z" level=info msg="Using private registry config file at /etc/rancher/agent/registries.yaml"
rancher-system-agent[82604]: time="2025-03-26T23:16:03Z" level=info msg="Pulling image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1"
rancher-system-agent[82604]: time="2025-03-26T23:17:33Z" level=warning msg="Failed to get image from endpoint: Get \"https://privateRegistry.com:5000/v2/\": dial tcp y.y.y.y:5000: i/o timeout"
rancher-system-agent[82604]: time="2025-03-26T23:17:33Z" level=error msg="error while staging: all endpoints failed: Get \"https://privateRegistry.com:5000/v2/\": dial tcp y.y.y.y:5000: i/o timeout: failed to get image privateRegistry.com:5000/rancher/system-agent-installer-rke2:v1.31.6-rke2r1"

HarrisonWAffel

I'm not sure if this is the right approach. If an instruction fails I would still expect the failure count to be updated in the secret before another attempt is made.

Have we looked into potentially resetting the appliedChecksumKey field in the secret after the system agent restarts and reattempts the instruction? It may be that this needs to be the general behavior, as opposed to plan/Windows specific (which it currently is IIRC).

pratikjagrut · 2025-04-07T08:36:16Z

@HarrisonWAffel, I made the changes per your suggestion. Can you please take a look?

jakefhyde

Non-functional change request: discussion on edge cases.

jakefhyde · 2025-04-07T20:16:28Z

pkg/k8splan/watcher.go

@@ -215,10 +215,10 @@ func (w *watcher) start(ctx context.Context, strictVerify bool) {
 				logrus.Infof("Detected first start, force-applying one-time instruction set")
 				needsApplied = true
 				hasRunOnce = true
+				secret.Data[appliedChecksumKey] = []byte("")


@HarrisonWAffel I believe this will work, but I think we need to address ResetFailureCountOnStartup if we are changing the default behavior. Do you think we would be fine removing it from the plan and then removing it from Rancher? What about re-running failed plans at startup? Asking because these seem like mostly windows cases so I need a refresher on the context (I remember there is some interesting behavior regarding potentially cyclic/competing services).

There is also the fact that this introduces a change to the system-agent-upgrader plan which will re-run the latest plan during the upgrade, not a bug but definitely worth noting, since plans may be running during agent upgrades (already a possibility now, but much more likely something shows up in the UI).

In v2.10.0 the planner was updated to reattempt the Windows install plans multiple times before marking them as failed, as there can be transient issues that are not representative of a true plan failure. The problem I encountered was that, if a plan failed to apply 3 times but then succeeded, it would only be reattempted two times after the next reboot as the failure-count would still be equal to 3.

Handling this situation was complicated by an issue in rke2 which resulted in calico HNS namespaces being deleted each time rke2 was restarted (typically via the one-time-instruction). In that case, the plan should not be reattempted. If it was, the node might eventually be marked as available, even though some behavior (like deleting pods) would be completely broken. The solution was to introduce this field and conditionally set it based off of the clusters k8s version.

I think we still need to consider that situation. The rke2 fix was delivered in August of last year, so users have had plenty of time to upgrade, but removing this field and changing the default behavior could potentially silently break some existing clusters. I would be in support of doing that for 2.12, and communicating it in the release notes.

The existing change for applied-checksum shouldn't run into the above issue though.

HarrisonWAffel

Approving from a functional perspective. It should not be 100% required to remove the ResetFailureCountOnStartup in order to resolve this issue on linux nodes

pratikjagrut force-pushed the retry.onetime.instruction branch from 1d3de06 to 15b7568 Compare March 26, 2025 22:50

pratikjagrut changed the title ~~[Draft]fix: add retry logic for one time instruction~~ fix: add retry logic for one time instruction Mar 26, 2025

pratikjagrut marked this pull request as ready for review March 26, 2025 23:36

pratikjagrut requested review from Oats87, jakefhyde and HarrisonWAffel as code owners March 26, 2025 23:36

pratikjagrut force-pushed the retry.onetime.instruction branch from 15b7568 to 845d8b0 Compare March 27, 2025 20:03

HarrisonWAffel requested changes Apr 1, 2025

View reviewed changes

fix: add retry logic for one time instruction

dcb804b

pratikjagrut force-pushed the retry.onetime.instruction branch from 845d8b0 to dcb804b Compare April 4, 2025 19:52

jakefhyde requested changes Apr 7, 2025

View reviewed changes

HarrisonWAffel approved these changes Apr 29, 2025

View reviewed changes

jakefhyde self-requested a review April 29, 2025 17:20

jakefhyde approved these changes Apr 29, 2025

View reviewed changes

jakefhyde merged commit c561fbc into rancher:main Apr 29, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: add retry logic for one time instruction #226

fix: add retry logic for one time instruction #226

Uh oh!

pratikjagrut commented Mar 25, 2025 •

edited

Loading

Uh oh!

HarrisonWAffel left a comment

Uh oh!

pratikjagrut commented Apr 7, 2025

Uh oh!

jakefhyde left a comment

Uh oh!

jakefhyde Apr 7, 2025 •

edited

Loading

Uh oh!

HarrisonWAffel Apr 7, 2025 •

edited

Loading

Uh oh!

HarrisonWAffel left a comment

Uh oh!

Uh oh!

Uh oh!

fix: add retry logic for one time instruction #226

fix: add retry logic for one time instruction #226

Uh oh!

Conversation

pratikjagrut commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Testing

Uh oh!

HarrisonWAffel left a comment

Choose a reason for hiding this comment

Uh oh!

pratikjagrut commented Apr 7, 2025

Uh oh!

jakefhyde left a comment

Choose a reason for hiding this comment

Uh oh!

jakefhyde Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HarrisonWAffel Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HarrisonWAffel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pratikjagrut commented Mar 25, 2025 •

edited

Loading

jakefhyde Apr 7, 2025 •

edited

Loading

HarrisonWAffel Apr 7, 2025 •

edited

Loading